.. _`Text Count Vectorizer`:

.. _`org.sysess.sympathy.machinelearning.count_vectorizer`:

Text Count Vectorizer
`````````````````````

.. image:: count_vectorizer.svg
   :width: 48


Convert a collection of text documents to a matrix of token counts


Documentation
:::::::::::::

Attributes
==========

    **stop_words_**


    **vocabulary_**
        A mapping of terms to feature indices.


Definition
::::::::::


Output ports
============

    **model**  model
        Model


Configuration
=============

    **Analyzer** (analyzer)
        Whether the feature should be made of word n-gram or character
        n-grams.
        Option 'char_wb' creates character n-grams only from text inside
        word boundaries; n-grams at the edges of words are padded with space.

        If a callable is passed it is used to extract the sequence of features
        out of the raw, unprocessed input.

        .. versionchanged:: 0.21

        Since v0.21, if ``input`` is ``filename`` or ``file``, the data is
        first read from the file and then passed to the given callable
        analyzer.
    **Binary** (binary)
        If True, all non zero counts are set to 1. This is useful for discrete
        probabilistic models that model binary events rather than integer
        counts.
    **Decoding error behavior** (decode_error)
        Instruction on what to do if a byte sequence is given to analyze that
        contains characters not of the given `encoding`. By default, it is
        'strict', meaning that a UnicodeDecodeError will be raised. Other
        values are 'ignore' and 'replace'.
    **Encoding** (encoding)
        If bytes or files are given to analyze, this encoding is used to
        decode.
    **Lowercase** (lowercase)
        Convert all characters to lowercase before tokenizing.
    **Maximum document frequency** (max_df)
        When building the vocabulary ignore terms that have a document
        frequency strictly higher than the given threshold (corpus-specific
        stop words).
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.
    **Maximum features** (max_features)
        If not None, build a vocabulary that only consider the top
        `max_features` ordered by term frequency across the corpus.
        Otherwise, all features are used.

        This parameter is ignored if vocabulary is not None.
    **Minimum document frequency** (min_df)
        When building the vocabulary ignore terms that have a document
        frequency strictly lower than the given threshold. This value is also
        called cut-off in the literature.
        If float, the parameter represents a proportion of documents, integer
        absolute counts.
        This parameter is ignored if vocabulary is not None.
    **N-gram range** (ngram_range)
        The lower and upper boundary of the range of n-values for different
        word n-grams or char n-grams to be extracted. All values of n such
        such that min_n <= n <= max_n will be used. For example an
        ``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means
        unigrams and bigrams, and ``(2, 2)`` means only bigrams.
        Only applies if ``analyzer`` is not callable.
    **Stop words** (stop_words)
        If 'english', a built-in stop word list for English is used.
        There are several known issues with 'english' and you should
        consider an alternative (see stop_words).

        If a list, that list is assumed to contain stop words, all of which
        will be removed from the resulting tokens.
        Only applies if ``analyzer == 'word'``.

        If None, no stop words will be used. In this case, setting `max_df`
        to a higher value, such as in the range (0.7, 1.0), can automatically detect
        and filter stop words based on intra corpus document frequency of terms.
    **Strip accents** (strip_accents)
        Remove accents and perform other character normalization
        during the preprocessing step.
        'ascii' is a fast method that only works on characters that have
        a direct ASCII mapping.
        'unicode' is a slightly slower method that works on any characters.
        None (default) means no character normalization is performed.

        Both 'ascii' and 'unicode' use NFKD normalization from
        :func:`unicodedata.normalize`.


Implementation
==============

.. automodule:: node_text
    :noindex:

.. class:: CountVectorizer
    :noindex: